The Annals of Applied Statistics 
2011, Vol. 5, No. 1, 486-522 
DOI: 10.1214/10-AOAS374 

© Institute of Mathematical Statistics. 2011 

ORTHOGONAL SIMPLE COMPONENT ANALYSIS: A NEW, 
EXPLORATORY APPROACH 

By Karim Anaya-Izquierdo, Frank Critchley and Karen Vines 

The Open University 

Combining principles with pragmatism, a new approach and ac- 
companying algorithm are presented to a longstanding problem in 
applied statistics: the interpretation of principal components. Fol- 
lowing Rousson and Gasser [53 (2004) 539-555] 

the ultimate goal is not to propose a method that leads au- 
tomatically to a unique solution, but rather to develop tools 
for assisting the user in his or her choice of an interpretable 
solution. 

Accordingly, our approach is essentially exploratory. Calling a vector 
'simple' if it has small integer elements, it poses the open question: 

What sets of simply interpretable orthogonal axes — if any — are 
angle-close to the principal components of interest? 

its answer being presented in summary form as an automated visual 
display of the solutions found, ordered in terms of overall measures of 
simplicity, accuracy and star quality, from which the user may choose. 
Here, 'star quality' refers to striking overall patterns in the sets of 
axes found, deserving to be especially drawn to the user's attention 
precisely because they have emerged from the data, rather than being 
imposed on it by (implicitly) adopting a model. Indeed, other things 
being equal, explicit models can be checked by seeing if their fits occur 
in our exploratory analysis, as we illustrate. Requiring orthogonality, 
attractive visualization and dimension reduction features of principal 
component analysis are retained. 

Exact implementation of this principled approach is shown to pro- 
vide an exhaustive set of solutions, but is combinatorially hard. Prag- 
matically, we provide an efficient, approximate algorithm. Through- 
out, worked examples show how this new tool adds to the applied 
statistician's armoury, effectively combining simplicity, retention of 
optimality and computational efficiency, while complementing exist- 
ing methods. Examples are also given where simple structure in the 
population principal components is recovered using only information 
from the sample. Further developments are briefly indicated. 
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1. Introduction and overview. Principal components are linear combina- 
tions of a set of, say, p commensurable variables with coefficients ('loadings') 
given by eigenvectors of their covariance or correlation matrix S. As such, 
they simultaneously enjoy many optimal properties: see, for example, Jolliffe 
(2002), Chapters 2 and 3. However, to be useful in practice, such components 
often need interpretation in the context of the data studied. Unfortunately, 
optimality is no guarantee of interpretability. Accordingly, principal compo- 
nents may possess optimal theoretical properties, but be of limited practical 
interest. This motivates replacing them by components which are more in- 
terpretable by virtue of being 'simpler' in some sense, albeit at the expense 
of some degree of optimality. 

We begin with a brief overview of existing approaches to this problem, 
further details being available in the references cited. 

1.1. Existing approaches. In a broad sense, simplicity means the appear- 
ance of nice structures in the loadings matrix Q = (qj • • • |q&) which con- 
tains the k < p eigenvectors of interest. Often, the scientist in charge of 
the study would like to see if there are clear-cut patterns reflected in Q 
which help him or her to better understand the meaning of the components 
qjx (r = 1, . . . , k) which it generates. Examples of nice structures include 
the presence of simple weighted averages, contrasts, groups of variables and 
sparseness. However defined, simplicity inevitably implies some loss of opti- 
mality and it is the scientist in charge of the study who needs to calibrate 
the trade-off between simplicity and optimality, as we further comment in 
Section 1.2. 

The oldest approach to simplifying principal components is rotation, ex- 
ploiting the fact that — as with principal component analysis itself — rotation 
of the p original axes (one for each variable) defines new orthogonal co- 
ordinate axes on which the data can be displayed while total variance is 
preserved. This provides attractive visualization and dimension reduction 
features. In particular, there being no double counting of total variance, the 
user can identify and plot the data on just those axes making the largest or 
smallest contributions to it, depending on the focus of scientific interest — 
explaining variability or exploring potential scientific laws (near constant 
linear relations among the variables). 

Only rotation methods are guaranteed to provide new axes which are or- 
thogonal. Nonrotation methods in general lack the attractive features noted 
above, joint visualization of components being impeded by nonorthogonal- 
ity of axes and dimension reduction by loss of the additive decomposition of 
total variance. 

Overall, the rotation approach to simplification seeks more interpretable, 
orthogonal axes while retaining as much optimality as possible. See, for 
example, Chapter 11 of Jolliffe (2002), which provides an excellent overall 
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review of simplification of principal components as of 2002. More recently, 
assuming normality, Park (2005) has proposed a penalized profile likelihood 
method, using varimax as the penalty function, which favors rotation of 
ill-defined components (those whose eigenvalues are close). However, in all 
these methods, the loadings involved are usually real numbers, which means 
that interpretation can still be difficult. 

Another approach to simplification is to target sparsity. The presence 
of many zeroes in Q can be useful for interpretation, for example, when 
dealing with many variables. See, for example, D'Aspremont et al. (2007), 
Farcomeni (2009), Chipman and Gu (2005) and the references therein. One 
class of methods which targets sparseness is that based on the Least Absolute 
Shrinkage Selection Operator (LASSO). See, for example, the papers by 
Trendafilov and Jolliffe (2007), Zou, Hastie and Tibshirani (2006), Sjostrand, 
Stegmann and Larsen (2006) and Jolliffe, Trendafilov and Uddin (2003). 
Although most of these methods lead to orthogonal simplified components, 
combined with the presence of exact zero loadings, the remaining loadings 
are still real numbers, again impeding interpretation. 

Other approaches simplify by imposing specific structures on the original 
data matrix X and can be seen as constrained singular value decompositions. 
For example, the semidiscrete decomposition (SDD) approach of Kolda and 
O'Leary (1998) restricts the loadings to lie in {—1,0, 1}. Again, the nonneg- 
ative matrix factorization (NMF) approach of Lee and Seung (1999) requires 
the original variables to be nonnegative, decomposing X into two nonnega- 
tive factor matrices. More recently, plaid models [see, for example, Lazzeroni 
and Owen (2002)] impose various block structures on X which are useful for 
interpretation in gene expression microarray data. However, this class of 
methods does not require orthogonality of the simplified components, with 
the potential loss of attractive features noted above. 

A more explicitly modeling approach to simplification has recently been 
suggested in Rousson and Gasser (2004). Intrinsically restricted to prin- 
cipal component analysis of a correlation matrix, it assumes a particular 
pattern in the eigenstructure of that matrix in which groups and contrasts 
of variables are forced to appear. Although not always appropriate, it is 
when all variables are positively correlated, the first eigenvector being then 
a weighted average of the variables and, consequently, the remaining eigen- 
vectors being basically contrasts. The loadings obtained are all proportional 
to integers, aiding interpretation. However, the components obtained need 
not be orthogonal, again with the potential drawbacks noted above. 

The approaches in Hausman (1982), Sun (2006) and Vines (2000) are 
similar to the one presented here, in the sense that all three give orthog- 
onal components with loading vectors proportional to integers. Hausman's 
method only allows the loadings to take the values —1, or 1 and so is not 
always able to find a complete set of orthogonal vectors. In contrast, Vines' 
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method produces loading vectors that are proportional to integers via a se- 
quence of pairwise 'simplicity-preserving' transformations which ensure that 
orthogonality is maintained. However, although always proportional to inte- 
gers, the size of the integers is not bounded and may at times be very large. 
A fuller discussion of the method, and its properties, can be found in Sun 
(2006). 

1.2. Interpretability. With others, we note that interpretability is neither 
guaranteed nor amenable to precise mathematical formulation, this latter 
being evidenced by the variety both between and within methods reviewed 
above. 

These remarks have two key methodological consequences. First, whereas 
simplification can help in the vital step of interpretation, we do not ex- 
pect any method to lead to interpretable results in all cases. And second, 
rather than attempt to find a unique optimal simplification in any prede- 
fined sense — in particular, rather than attempt to completely automate the 
trade-off between simplicity and optimality — they provide motivation for 
adopting an essentially exploratory approach which systematically produces 
an ordered range of solutions, from which the user can choose one or more 
preferred solutions. 

Factors that can guide this choice include the following: (a) the criteria 
on which a method is based, (b) subject matter considerations, particular 
to the context of the data set under analysis, and (c) suboptimality with 
respect to exact principal component analysis, including loss of explanatory 
power — or of focus on potential scientific laws — and correlation. Only princi- 
pal component analysis itself can give orthogonal loadings and uncorrelated 
components, and so any other rotation method will always show some degree 
of correlation. 

1.3. Overview of a new approach. Beginning with a synoptic account, 
we give here an overview of our new, exploratory approach. Requiring or- 
thogonality, it is based on three primary criteria: simplicity, angle-accuracy 
and 'star quality' 

1.3.1. A synoptic account. Retaining the attractive visualization and di- 
mension reduction features noted above, the approach to be presented is 
based on rotation to axes which are 'simple' in the sense — adopted hence- 
forth — that each is defined by small integer loadings. It combines principles 
with pragmatism, complementing those already available. Following Rous- 
son and Gasser (2004), 

the ultimate goal is not to propose a method that leads automatically to a 
unique solution, but rather to develop tools for assisting the user in his or her 
choice of an interpretable solution. 
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Accordingly, our approach is essentially exploratory, posing the open ques- 
tion: 

What sets of simply interpretable orthogonal axes — if any — are angle-close to 
the principal components of interest? 

its answer being presented in summary form as an automated visual display 
of the solutions found, ordered in terms of overall measures of simplicity, 
angle-accuracy and star quality, from which the user may choose. 

Here, 'star quality' refers to striking overall patterns in the sets of axes 
found, deserving to be especially drawn to the user's attention precisely 
because they have emerged from the data, rather than been imposed on 
it by (implicitly) adopting a model. Indeed, other things being equal, ex- 
plicit models can be checked by seeing if their fits occur in our essentially 
exploratory analysis, as we illustrate. 

Our approach treats the components of interest equally, reflecting equal 
scientific interest in them. Along with later worked examples, the one that 
follows illustrates the appropriateness of adopting this principle. Adapta- 
tions of our methodology to other scientific contexts — notably, to those 
where interest focuses exclusively on explaining variability — are noted in 
Section 4. 

Again, our approach trades angle-accuracy off against simplicity, with a 
bias toward the latter. Its exact implementation provides an exhaustive set 
of solutions but can be prohibitively hard, the solution space having com- 
binatorial complexity which grows with p, k and N* , the maximum size of 
integer allowed. However, the nature of our approach allows efficient explo- 
ration of this vast space without restriction to any of its particular subsets, 
such as those determined by modeling assumptions. Pragmatically, we are 
able to provide an efficient, approximate algorithm for this computationally 
challenging problem. 

1.3.2. A worked example: Blood flow data. A worked example illustrates 
this new approach. Figure 1, whose construction and terms are described in 
Section 2, summarizes its results on the covariance matrix for four different 
measurements of an index of resistance to flow in blood vessels [see the 
paper by Thompson, Vines and Harrington (1999)]. Here, p = k = A and, as 
throughout the paper, we take the maximum integer allowed (N*) to be 9, 
this corresponding to allowing only single digit representations. 

Three solutions are obtained and ordered as shown, none of them being 
dominant in terms of both simplicity and accuracy. The user is referred first 
to the one 'two star' solution found, Si, also obtained by Vines (2000) and 
by the undeflated form of Chipman and Gu (2005) [recall that Rousson and 
Gasser (2004) cannot be used for covariance matrices]. This two star solution 
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Table 1 

Principal component loadings for the blood flow data qi , . . . , q4 and the correspon 
simplified loading vectors zi, . . . , Z4 for solution Si 



Variable 


qi 


Zl 


Q2 


Z2 


qa 


Z3 


q4 


z 4 


Right doppler 


0.42 


1 


-0.32 


-1 


-0.58 


-1 


-0.62 


-1 


Left doppler 


0.43 


1 


0.30 


1 


-0.55 


-1 


0.65 


1 


Right CVI 


0.55 


1 


-0.65 


-1 


0.43 


1 


0.30 


1 


Left CVI 


0.58 


1 


0.63 


1 


0.42 


1 


-0.31 


-1 


Variance (%) 


58.0 


57.0 


25.9 


23.8 


9.5 


10.5 


6.5 


8.6 



is also the simplest one found in this case, details being shown in Table 1 
along with the original principal components. 

The simplified loadings here have a very clear structure and are easier 
to understand than the continuous ones, so much so, in fact, that it looks 
like we have uncovered nature's design: a main effect, plus three orthogonal 
contrasts. The simplified components being orthogonal, the total variance is 
retained, being redistributed among the components so as to enhance inter- 
pretability. In particular, there is just a little loss in the variance explained 
by the first two components, while the relatively small variability in the last 
two suggests possible underlying regularities. 



♦ Two star 
One star 



0.01 0.02 
<- Accuracy 



0.06 0.07 
Discrepancy -> 



Fig. 1. A graphical summary of the solutions obtained for the blood data by our approach. 
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The user is referred next to the other, here 'one star,' solutions, starting 
with the simpler one. Having the same sign pattern, just different weights, 
they have essentially the same overall interpretation as Si . Successively gain- 
ing accuracy at the cost of some simplicity, the simplified loading vectors 
and percentages of variance explainedfor S2 and S3 are given in Table 2. 

Overall, as this and later examples will show, we can obtain good ap- 
proximations in the sense that only small integers are used, while retaining 
closeness to the original components and exact orthogonality. A distinctive 
feature of our exploratory approach is that the user is provided with an 
ordered set of alternative views of the same data, from which s/he may 
choose. 

We move now to put some flesh on the bones of the synoptic account 
above, noting first that intrinsic interest lies in eigenaxes, not eigenvectors. 

1.3.3. Eigenvectors, eigenaxes and their approximation. Recall that in- 
terest centers on a p x k loadings matrix Q = (q 1 | • • • |q/%) containing the 
eigenvectors of interest. Without loss, these are normalized to unit length 
(||q r || = 1, r = 1, . . . , k), Ai > ■ • ■ > Afe being the corresponding eigenvalues. 
The overall sign of each eigenvector is arbitrary. Rather, interest really cen- 
ters on the ordered set of axes ±Q := (±q 1 | • • • |±q fc ), where we identify any 
pair of nonzero opposed vectors ±q with the axis (line through the origin 
or one-dimensional subspace) ^(q) := {cq: —00 < c < 00} containing them. 

The approach taken here treats the columns of ±Q equally. It retains 
their orthogonality while replacing each eigenaxis a = £(q) by another one 
a = £(z), close to it in angle terms, which is 'simple' in the sense that it 
contains a nonzero vector z with small integer elements. There is no loss 
in taking the highest common factor of the absolute values of the nonzero 
elements of z, denoted hcf(|z|), to be 1. For, if not, we can divide each 
element of z by it, without changing £(z). 

Overall then, ±Q is approximated by ±Z := (±zi| • • • |±Zfc) where Z := 
(zi|---|zfc) belongs to the set Z(p,k) of all p x k integer matrices with 

Table 2 

Integer representations of solutions S2 and S3 for the blood flow data 



S2 S3 



Variable 


Zl 


z 2 


Z.3 


Z4 


Zl 


Z2 


Z3 


Z4 


Right doppler 


1 


-1 


-1 


-2 


2 


-1 


-3 


-2 


Left doppler 


1 


1 


-1 


2 


2 


1 


-3 


2 


Right CVI 


1 


-2 


1 


1 


3 


-2 


2 


1 


Left CVI 


1 


2 


1 


-1 


3 


2 


2 


-1 


Variance (%) 


57.0 


25.9 


10.5 


6.5 


57.9 


25.9 


9.7 


6.5 
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nonzero, pairwise orthogonal columns in each of which the absolute val- 
ues of the nonzero elements are coprime, two members of this set being 
axis-equivalent if they differ, at most, in the overall signs of their columns. 

1.3.4. Four maxims. Our approach is driven by four maxims, adopted 
for specific methodological reasons. Briefly, these are as follows. 

(1) Integers aid interpretation. This maxim speaks for itself: we require 
linear combinations of variables defined by simple vectors since they are 
typically much easier to interpret than the principal components which they 
approximate. Again, exact zeroes and simple averages appear naturally. 

Approximating an eigenaxis a = Z(q) by a simple axis a = l(z) where 
hcf(|z|) = 1, we call z an integer representation of a and the maximum 
absolute value of its elements the complexity of z — interchangeably, of a — 
denoting these complexities by compl(z) = compl(a). 

Other things being equal, we seek to keep the complexity of each a r 
(r = 1, . . . , k) as low as possible. 

(2) Be angle accurate (for the k eigenvectors of interest). By keeping each 
approximating vector angle-close to its exact counterpart, we ensure that we 
do not lose potentially meaningful individual eigenvectors and that overall 
optimality is maximally retained. This is consistent with our principle of 
equal treatment of all the eigenaxes of interest, while providing a natural, 
operational measure of discrepancy, both for each axis separately and — it 
turns out — overall. 

More specifically, we measure the discrepancy with which a simple axis 
a = ±z approximates an eigenaxis a = ±q by the acute angle 



between them, this being a (geodesic) distance measure between axes. Equiv- 
alently, for reporting purposes, we may use the accuracy measure 



this taking values in [0, 1] . 

It turns out that, when approximating each of a set of axes, the greater 
the minimum angle-accuracy attained overall, the closer the original and 
approximating sets are in terms of a natural measure of distance (see Ap- 
pendix A). 

(3) Be biased toward simplicity. It is always possible to approximate with 
reasonably high accuracy a single p-dimensional axis £(q) by a simple axis 
of low complexity. Figure 2 shows, for different values of p and cos(#), the 
empirical distribution (based on 10,000 independent replications) of the min- 
imum complexity Ni(9) required for there to be a simple axis having accu- 
racy greater than cos(#) when £(q) is sampled from the uniform distribution 





accu(a, a) := cos(d(a, a)) 
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1 2 3 4 5 6 



1 2 3 4 5 6 



1 2 3 4 5 6 7 



Fig. 2. Empirical distribution of minimum complexities Ni(6) in simple approximations 
to p- dimensional space. 



over the set of all possible axes [see Fang and Li (1997)]. Clearly, without 
orthogonality restrictions, accurate approximations of axes tend not to be 
very complex. 

However, there is a clear trade-off between simplicity and accuracy: highly 
accurate approximations usually have high complexity, making interpreta- 
tion more difficult. In general, we choose the simplest possible axis that is 
accurate enough, this bias toward simplicity being, in effect, a bias toward 
interpretability. In other words, in case of conflict, we favor maxim 1 over 
maxim 2. 

(4) Orthogonality brings benefits. Primarily, we choose orthogonality be- 
cause it aids interpretation. Our rotation approach enjoys the general visu- 
alization and dimension reduction features recalled above. Although none of 
these additional features is either targeted or imposed, sparsity, contrasts, 
simple relations between components and groups of variables may all emerge 
as a consequence of using orthogonality combined with integer coefficients. 
Orthogonality is also useful at several stages of the development, as we note. 
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1.4. Organization and running example. Section 2 develops these max- 
ims into a methodology, the running example below being used for illustra- 
tion throughout. The reader interested primarily in how this new approach 
performs may wish to skip this development and go straight to Section 2.5, 
where its results are summarized. Further examples are given in Section 3. 
Section 4 gives a short discussion of complements and extensions. Technical 
and computational details are given as Appendices. 

The running example used is based on Table 3 which shows the unit length 
eigenvectors (rounded to 2 decimal places) of the sample correlation matrix 
for a data set consisting of the scores achieved by 88 students in p = k = 5 
tests, a combination of open- and closed-book exams [Mardia, Kent and 
Bibby (1979)]. Thus, the first principal component is a weighted average of 
all the different subject scores, while the other principal components can 
be interpreted as contrasts. However, more detailed interpretation of the 
principal components, particularly those other than the first, is not easy. 

Throughout, TP denotes the set of all p x 1 vectors with integer elements — 
positive, negative or zero — and T^ the same set with the zero vector re- 
moved. Replacing integers by real numbers, the corresponding sets are de- 
noted MP and M^ p \ respectively. 

2. A new approach. 

2.1. A sequential approach. Operationally, we approximate the k eige- 
naxes of interest sequentially. The order in which we do this matters, for 
two principal reasons: earlier approximations restrict the approximations 
available for later eigenaxes and, hence, their maximum possible achievable 
accuracy. 

To illustrate these points consider, say, the 'forwards' 1 to /c order from 
high to low eigenvalue. When dealing with ct\, there are no orthogonality 
restrictions and we seek an approximation a\ to it in the set Mi of all simple 
axes in MP. In contrast, for each r E {2, . . . , k}, we seek an approximation 



Table 3 

Principal component loadings for the exams data 





qi 


q2 


qa 


q4 


qs 


Mechanics (closed) 


0.40 


-0.65 


0.62 


0.15 


-0.13 


Vectors (closed) 


0.43 


-0.44 


-0.71 


-0.30 


-0.18 


Algebra (open) 


0.50 


0.13 


-0.04 


0.11 


0.85 


Analysis (open) 


0.46 


0.39 


-0.14 


0.67 


-0.42 


Statistics (open) 


0.40 


0.47 


0.31 


-0.66 


-0.23 


Variance (%) 


63.6 


14.8 


8.9 


7.8 


4.9 
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Table 4 

Integer representations for the examinations data with # = 7r/4 



Variable 


Z!(0) 


z 2 (0) 


^3(0) 


2 4 (0) 


z 5 (9) 


Mechanics (closed) 


1 


1 


1 





1 


Vectors (closed) 


1 


1 


-1 





1 


Algebra (open) 


1 











-4 


Analysis (open) 


1 


-1 





1 


1 


Statistics (open) 


1 


-1 





-1 


1 


Accuracy 


0.997 


0.973 


0.9375 


0.937 


0.974 


Max accuracy ||q r || 


1 


0.999 


0.99 


0.95 


0.97 


Variance (%) 


63.3 


14.4 


8.9 


7.9 


5.5 



a r to a r within the set Ai r of all simple axes in W orthogonal to each of 

Thus, for the exams data, A4± is the set of all axes generated by vectors 
in while, for example, taking zi = (1, 1, 1, 1, 1) T , £((1, 1,0, -1, -1) T ) is 
a member of M2, but 1,0, 0, — 1) T ) is not. 

The second point is clear geometrically. The angle-closest axis to a or- 
thogonal to ai,..., a r _i is its projection onto the orthogonal complement 
of their span. This restricts the maximum accuracy that can be achieved. 
For, if q^ - is the orthogonal projection of the unit vector q r onto the orthog- 
onal complement of Spanjzi, . . . , z r _i}, some straightforward trigonometry 
shows that any approximation a G M r satisfies 

accu(a r ,a) = accu(a r ,£(q^))accu(£(q^r),a) 

= \\c^r\\accu(£(q r ),&), 

so that accu(a r , a) < ||q^||, equality holding if and only if a = £(q^r) 
[which requires £{c^r) to be simple]. Thus, over A4 r , not every possible ac- 
curacy is achievable for a r (r > 1), although no such upper bound applies 
to accu(£(q^r), a). 

For the exams data with a\ = £((1, 1, 1, 1, 1) T ), the projection of q2 onto 
the orthogonal complement of di is q^- = (—0.63, —0.42, 0.15, 0.41, 0.49) T 
(to 2 decimal places). Since ||q2 || = 0.999, there is no approximation to 
«2 orthogonal to 6l\ which can achieve an accuracy bigger than this. In 
particular, «2 = ^((1,1,0, —1,—1) T ) has an accuracy of 0.973 with respect 
to 02, while its accuracy with respect to ^(q^) is slightly higher, being given 
by accu(a2,d2)/||q2"|| = 0.973/0.999 ~ 0.974. Similar information for other 
axes is given in Table 4. 

Accordingly, to treat all axes of interest equally, we would in principle con- 
sider all k\ possible orders. In practice, this can be too many. Pragmatically, 
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restricting attention to just the following four orders has been found to work 
well. Together, they combine speed, accuracy and a balance between prior- 
itizing largeand small eigenvalues, the two 'next-best' orders incorporating 
an obvious greedy heuristic: 

Forwards (F): Take the eigenaxes in decreasing order of their eigenvalues. 

Backwards (B): Take the eigenaxes in increasing order of their eigenvalues. 

Next-best forwards (NF): Take first the eigenaxis with the largest eigen- 
value and then, sequentially, the one with the largest maximum possible 
achievable accuracy, ||q^"||, among those remaining. 

Next-best backwards (NB): Take first the eigenaxis with the smallest eigen- 
value and then, sequentially, the one with the largest maximum possible 
achievable accuracy, ||q^||, among those remaining. 

Whichever order is used to obtain them, the approximations found are re- 
ported in the same 1 to k order. 

To describe our approach in more detail, it suffices to consider a single, 
fixed order. We use the forwards order below. 

Note that when all eigenaxes are of interest (k = p), there is no choice to 
be made when approximating the final axis, there being a unique simple axis 
in M p satisfying the (p — 1) orthogonality requirements. In particular, the 
accuracy and complexity of the pth. axis approximation cannot be directly 
controlled. However, if the first {p — 1) are accurate, then so too is the last. 
Again, in general, the simpler the first (p— 1) approximations are, the simpler 
the last is. Overall, then, the number of axes for which approximations are 
sought (rather than forced by previous approximations) is k := min(/c,p— 1). 

2.2. Approximation for a given angle- accuracy. 

2.2.1. Paradigm: Best 6 -accurate simple approximation. We describe here 
the approximation paradigm at the heart of our approach. 

Recall that our approach favors simplicity over accuracy. Accordingly, 
subject to being accurate enough — while orthogonal to previously approx- 
imated axes — we seek the simplest possible approximation to each axis in 
turn. If there is more than one such axis, we choose the most accurate. 
More precisely, we adopt the paradigm: for a given angle 6, and for each 
r = 1, . . . , k in turn, seek the 'best ^-accurate simple' approximation & r (0) 
to a r in the following sense. 

For any 9 G (0,7r/2), we say that an axis a is ^-accurate for a r if it is 
within an angle 6 of it — that is, if accu(a,a r ) > cos(0). As we have just 
seen, there are no such axes in M. r unless cos(#) < ||q^-||, so we always make 
this requirement. 
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Again, we denote by N r (8) the smallest value of N G {1,2, . . .} for which 
there is a ^-accurate axis in Ai r having complexity N. Thus, the 'cone' 

C r {9) := {a £ M. r : accu(a, a r ) > cos(0), compl(a) = N r (9)} 

comprises all those axes in .A4 r with the minimal possible complexity N r (0) 
subject to being within an angle 9 of a r . For given 9, we define 'the best 
^-accurate simple' approximation a r {9) to a r as the axis in C r {9) closest to 
a r . That is, a r {9) is the closest of all the simplest possible, 9 -accurate axes 
in A4 r . Either of the two possible integer representations of dt r {9) will be 
denoted by i r {9). 

Finding a r (9) can be a hard combinatorial optimization problem, espe- 
cially when the dimension p is large. Therefore, to avoid the combinatorial 
complexity, we propose an algorithm to approximate a r {9) which, after a re- 
ordering of the variables, involves a computing effort linear in p, for use when 
exact calculations are prohibitive. We briefly describe such an algorithm in 
Appendix B. 

We call cos(#) the minimum accuracy required for the approximation to 
a. r . We use the same value for each of the k eigenaxes for which approxima- 
tions are sought, denoting by S(9) := (ai(9), . . . , ak(9)) the full set of approx- 
imations obtained. To measure the overall closeness of S(9) to (ai, . . . ,«fc), 
we use the minimum of the k accuracies attained {accu(a r , a r (9))}^ =1 , which 
we denote by MA(S(9)). As noted above, the larger this is, the smaller a 
natural measure of overall distance between these two ordered sets of axes 
(see, again, Appendix A). 

2.2.2. Tuning parameters. Our approach uses the tuning parameters N* 
and 9*, described here, its results being typically less sensitive to the choice 
of N* due to its bias toward simplicity. A third and final tuning parameter 
e, introduced for operational convenience, is described in Section 2.3. 

To facilitate interpretation, we require iV < N* , taking the single digit 
default iV* = 9 in all calculations reported here. Thus, in practice, it may 
not be possible to complete the set of approximations S(9), as some N r {9) 
may be found to exceed N* . A similar effect occurs in Hausman (1982) 
where, in effect, iV* = 1. 

As we want to stay close to the original eigenaxes, we require 9 <9* for 
some < 9* < vr/4, values of 9* greater than 45° clearly allowing poor ap- 
proximations. Thus, overall,_we have the following bounds on the accuracies 
attained for each r = 1, . . . ,k: 

(2.1) cos(6»*) < cos(0) < accu(a r ,a r (9)) < ||q^-||, 

where, for the first axis, we trivially have = qi, so that ||q^- 1| = 1. For 
most purposes, we recommend taking 9* = 7r/4, an exhaustive account of all 
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angles smaller than 9* being provided by considering an automated sequence 
of angles, as described in Section 2.3. This choice of 9* has the advantage 
that no potentially useful approximations are ruled out of consideration a 
priori. Rather, the user is free to draw the line regarding acceptable accuracy 
in the light of all the potentially useful solutions found. 

2.2.3. Running example revisited. We illustrate the above developments 
using our running example, with 9 = 9* = 7r/4. 

For the exams data, there is a (7r/4) -accurate axis for ct\ with complex- 
ity one; that is, N\(tt/£) = 1. Further, out of all axes of complexity one, 
^((1,1,1,1,1) T ) is the closest to «i. Therefore, zi(vr/4) = (1,1,1,1,1) T is 
an integer representation of the best (7r/4)-accurate simple approximation 

to CL\. 

Here, ^2(^/4) is also 1, there being many (vr/4) -accurate axes for 012 
with complexity one orthogonal to di(-7r/4), including £((1, 1,0, — 1, — 1) T ) 
and ^((1,0, 0, 0, — 1) T ). Of these, we prefer the former, their accuracies being 
0.973 and 0.789, respectively. In fact, it can be shown that 1,0, -1, -1) T ) 
is the best (7r/4)-accurate simple approximation to 02- 

Again, ^3(^/4) and are also 1, integer representations of the 

corresponding best (7r/4)-accurate simple approximations being given in Ta- 
ble 4 alongside zi(7r/4) and Z2(vr/4). An extra decimal place is used in re- 
porting accu (03, d3(-7r/4)) to show where the minimum accuracy is attained. 

As noted at the end of Section 2.1, there is no choice about £5(71-/4). 
However, illustrating the general points made there, a r {ir/4) being close to 
a r for each r = 1, . . . ,4, ^(tt/A) is also close to as (having an accuracy of 
0.974), while the relative simplicity of reflects that of 6t\ to 0:4. 

2.3. Effect of varying the minimum accuracy required. When 9 = it/ 4 the 
approximations S(9) typically have low complexity overall and so can usually 
be interpreted. Unless all the eigenaxes are already simple, we might expect 
the overall complexity of the approximations to steadily increase with the 
minimum accuracy required. However, it turns out there is no straightfor- 
ward relationship between the complexity of the approximations and 9. This 
nonmonotone behavior of the approximations S(9) when cos(#) increases is 
due to the discreteness inherent in our approximations. Restricting the el- 
ements of the integer representations to be coprime is mainly responsible 
for this, division by a highest common factor greater than 1 always being a 
possibility. 

The net effect is that it is not possible to fully predict the qualitative 
behavior of S(9) as 9 varies. Accordingly, instead of attempting to find an 
optimal value of cos(#) under some criterion, we vary the value of 9 so as to 
explore all possible sets of approximations. The different sets of orthogonal 
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axes thereby obtained offer different views of the same data set, giving the 
user more scope for interpretation. 

The good news is that it is only necessary to explore a discrete set of values 
of 9. To see this, we introduce the following notation. For any < 9 < 9* , 
we denote by 

S(0) := (oi(0), . . . , a~ m (9)), where 1(9) < k, 

the ordered set of approximate axes obtained among the k = mm(k,p — 1) 
sought. This set is complete [k(0) = k] unless there is a first k(9) < k with 
-^jfc(0)+i(^) f° un d to be greater than N*. When S(9) is complete, so too is 
the full set of k approximate axes S(9), being given by 

g(Q) = {S(9), ifk<p, 
{) \(S(9),a p (9)), ifk=p, 

where a p (9) is the unique simple axis in W orthogonal to the (p— 1) axes in 
S(9). Otherwise, S(9) itself is incomplete, and so not reported. In all cases, 
the minimum accuracy attained among the axes in S(9), denoted MA(S(9)), 
satisfies MA(S(9)) > cos(0), by (2.1). For any < 9 < 9*, defining 9+ < 9 by 

cos(# + ) = MA(S(0)), 

it follows that the same set of approximations is obtained [S(9] = S(9')] for 
all smaller angles 9' in the range (9 + ,9) determined by 

cos(#)<cos(#')<cos(#+), 

but that change happens at the more accurate end of this range, (9 + -accuracy 
precluding S{9) = S(9 + ). 

Thus, to fully explore the range of possible approximations, it is sufficient 
to consider the strictly decreasing sequence of angles 9^ 2 \ . . . defined by 

(2.2) 9^:=9* and 9^ := (9^) + (n>l). 

In practice, for operational convenience, we stop when the minimum accu- 
racy required cos(#) reaches (1 — e) for some small tuning parameter e. In 
general, no simple solutions are missed by doing this, approximations with 
very high minimum accuracy required usually being very complex. Experi- 
ence has shown that a value of e = 0.01 gives satisfactory results, while also 
keeping the computations fast. 

Key features of the gelation between consecutive sets of approximations 
obtained, S(9^) and S(9^ n+1 ^), now follow. Let 

(2.3) r n :=arg min accu(a r ,a r (9^)) 

l<r<fc(6»N) 
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Table 5 

Integer representations for the examinations data with cos(#' 3 ') =0.9375 



Variable 


Z!(0) 


z 2 (0) 


2a (6) 


2 4 (6») 


2s (6>) 


Mechanics (closed) 


1 


1 


2 


1 


1 


Vectors (closed) 


1 


1 


-2 


-1 


1 


Algebra (open) 


1 











-4 


Analysis (open) 


1 


-1 


-1 


2 


1 


Statistics (open) 


1 


-1 


1 


-2 


1 


Accuracy 


0.997 


0.973 


0.980 


0.979 


0.974 


Variance (%) 


63.3 


14.4 


8.9 


7.8 


5.5 



indicate the first approximation which changes from n to n + 1 . Earlier ap- 
proximated axes do not change as {a r (9^)} 1 ^l 1 are already #[ n+1 l-accurate. 
However, for a Tn an approximation strictly more accurate than a rn (9^) 
must be sought. Further, if k = p while S(9^) and S(9^ n+1 ^) are complete, 
the orthogonality restrictions imply that the subspace generated by the re- 
maining approximate axes is the same for S(9^- n+1 ^) as it is for S(9^). That 
is, 

span{z r (6> [n+1] ) : r = r n , . . . ,p} = span{z r (6» M ) : r = r n , . . . ,p}. 

In other words, we are obtaining a different, more accurate, orthogonal sim- 
ple basis for the same subspace. 

With 6>M = 0*=tt/4, the minimum accuracy required is l/y/2 » 0.7071. 
For the exams data, the minimum accuracy attained in this case is for the 
fourth eigenaxis, so that n = 4 and cos(6>t 2 l) = 0.937 (see Table 4). We there- 
fore have at once that a r (9™) = a r (9^) for r = 1,2 and 3. However, it 
is not possible to find an improved accuracy approximation 0:4 (0^) with 
complexity at most iV* = 9, so that k(9^) = 3 and S(9^) is incomplete. 
Without further calculation, Table 4 gives r 2 = 3, cos^) = 0.9375 and 
a T (9^) = &r(eW) = a r (0 [1] ) for r = 1 and 2. 

In fact, S(9^) is complete, corresponding integer representations being 
reported in Table 5. The increase in minimum accuracy required in going 
from 6® to 0P] is very small but, due to discreteness effects, results in a 
drop in the complexities of the third and fourth axis approximations down 
below N* =9. The newly approximated axes {dv(0®)}>=3 span the same 
subspace as {ct r (6 , W)}^ =3 . Further, az,(9^) = a 5 (0[ 1 ]) precisely because, in 
this example, {a 3 (0 [3] ), a 4 (0 [3] )} and {a 3 (# [1] ), a 4 (0 [1] )} 

span the same two- 
dimensional subspace. The S(9^) pair of axes here are now almost as simple 
as those for 5(0M) but more accurate, striking a different simplicity-accuracy 
trade-off. Comparing Tables 4 and 5, this example also illustrates that mov- 



ORTHOGONAL SIMPLE COMPONENT ANALYSIS 



17 



ing to a new simplicity-accuracy trade-off need not change the variances 
explained by each axis in any material way. 

From cos($M) = 0.973 onward, it is not possible to find complete sets of 
approximations S(8) with complexity at most N* = 9. Thus, for the for- 
wards order of approximation, Tables 4 and 5, detailing S(9^) and <S(#[ 3 ]), 
respectively, together cover the full range < 9 < 9* = ir /4. 

2.4. Automated visual display of solutions. Based on 5, the complete set 
of solutions (sets of approximate axes) S(8) found by one or more of the four 
orders of approximation described in Section 2.1, the user can now proceed 
to answer the open question posed at the outset: 

What sets of simply interpretable orthogonal axes — if any — are angle-close to 
the principal components of interest? 

In principle, an overall informed choice requires the user to compare all solu- 
tions regarded as angle-close in terms of a range of factors, including subject 
matter considerations, as described in Section 1.2. Only the individual user 
can calibrate the various trade-offs involved and different users will, quite 
reasonably, choose different (numbers of) solutions. 

In practice, the work involved can be substantial and, to help the user 
make this choice, we provide a summary automated visual display in which 
the solutions found are ordered in terms of overall measures of star quality, 
simplicity and accuracy. Tabular information for each solution is presented 
to the user in this order. In describing this automated display here, we 
emphasize that — although clearly principled — this order of solutions does 
not, indeed cannot, presume to be the preference order for any particular 
user. 

2.4.1. Accuracy-simplicity scatterplot. Prioritizing our main criteria of 
simplicity and accuracy, each solution S is plotted at a point in the positive 
quadrant with the following coordinates. Horizontally, we use the discrep- 
ancy measure 

discr(S) = 1 - MA(S), 

a natural measure of (squared) distance between the eigenaxes (±q 1 | • • • |±q&) 
and S. The smaller discr(S), the more accurate S. Vertically, we use overall 
complexity measure 

compl(S)=N max (S) + X(S), 
where N max (S) is the maximum complexity of the axes in S, the term < 

\/V V S 2 l(pk) i 

X(S) := — — h r fe ~ < 2 being included to further discriminate between 
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Fig. 3. Solution set S for the exams data. 



solutions with the same maximum complexity. The smaller compl(S), the 
simpler S. 

Figure 3 shows the corresponding scatterplot for the exams data where, 
overall, 12 different solutions were obtained. The numbering of the solutions 
reflects a particular principled order described below, together with the plot 
symbols used. In particular, the forwards solutions S(9^) and S(6^) dis- 
cussed above appear here as Si and S7, respectively. 

2.4.2. Minimal and dominated solutions. Low values of both discr(S) 
and compl(S) are clearly desirable, but cannot usually be simultaneously 
achieved. For example, Figure 3 shows that there is a trade-off for the exams 
data, with no solution attaining the smallest value of both these coordinates. 
However, the five solutions joined by straight lines are visibly special, the 
rectangle formed by each of them with the origin containing no other solu- 
tions. For any set of solutions, we call these the minimal solutions — those for 
which no lower value of either coordinate can be found without increasing 
the other. Thus, here, among S3, S4, S5, Sq and S12, the simpler solutions 
are less accurate, and the more accurate solutions are less simple. 

There is always at least one minimal solution, usually more. Together, 
they form the lower-left boundary of the scatterplot whose shape reflects, 
in any particular case, the trade-off between simplicity and accuracy. All 
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Table 6 

Integer representations of the two star solutions for the exams data 



Variable 






Si 










s 2 






Zl 


Z2 


z 3 


z 4 




Zl 


Z2 


z 3 


z 4 


Z5 


Mechanics 


1 


1 


1 





1 


1 


1 


1 


1 





Vectors 


1 


1 


-1 





1 


1 


1 - 


-1 


1 





Algebra 


1 











-4 


1 


-1 





1 


1 


Analysis 


1 


-1 





1 


1 


1 


-1 





1 - 


-1 


Statistics 


1 


-1 





-1 


1 


1 








-4 





Accuracy 


0.997 


0.973 0.9375 


0.937 0.974 


0.997 


0.802 


0.9375 


0.729 


0.897 


Variance (%) 63.3 


14.4 


8.9 


7.9 


5.5 


63.3 


12.1 


8.9 


9.9 


5.8 



other solutions are dominated by a minimal solution. For example, here, Si 
and 5V are dominated by S4, with S4 being simpler and more accurate than 
both. 

2.4.3. Star quality solutions. In general, focusing only on solutions which 
lie in the minimal set does not necessarily capture all clearly interpretable 
solutions. Other, dominated, solutions may possess 'star quality' in the sense 
that there are striking overall patterns in the set of approximate axes found, 
deserving to be especially drawn to the user's attention. For example, Si 
(Table 4, repeated here as the left-hand part of Table 6) has a very clear 
and interpretable structure and so deserves to be brought early to the user's 
attention, even though it does not lie in the minimal set. For many users 
this increased interpretability is likely to be worth the cost in terms of dis- 
crepancy and overall complexity. 

We recognize that interpretability is a subjective concept. Rather than 
attempting a quantification, we use a star rating system to indicate the 
degree to which a solution conforms with one of a predefined set of clear 
structures: 'two star' solutions conform to the clearest structures and 'one 
star' to the next clearest, while 'unstarred' solutions do not conform to any 
of the predefined structures. 

Here, we use six predefined structures, these being one and two star ver- 
sions of three mutually exclusive types, denoted A, B and C, described next. 
There are clear points of contact with the work of Rousson and Gasser, sum- 
marized in the following Section 2.4.4. 

Let Z = (zi I • • • |zfe) be a matrix of integer representations of S. Each 
structure used requires that the p variables can be partitioned into b > 1 
blocks, where each block labels the set of nonzero (by convention, positive) 
elements of a single-signed column z r . If b < k, orthogonality entails that 
the remaining (k — b) columns of Z are contrasts — that is, have elements 
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of both signs. The within-block condition W-B [condition 3 in Rousson and 
Gasser (2004)] holds if the nonzero elements of each contrast occur within 
a single block. 

In this terminology, the three mutually exclusive types of predefined struc- 
ture are as follows: 

A: b = 1 — that is, an overall (possibly, weighted) mean, plus orthogonal 
contrasts. 

B: b > 1 and W-B holds, so that each block has type A structure. 
C: b > 1 and W-B does not hold. 

The type of each starred solution is noted in its table of information but, 
for visual clarity, not in the accuracy-simplicity scatterplot. 

We call z r parsimonious if N$, the number of distinct nonzero elements 
it contains, is small. The more parsimonious a starred solution, the clearer 
its structure. Accordingly, we award two stars when it is as parsimonious as 
possible of its type, and one star otherwise. Orthogonality entails that, for 
each type, two star solutions are precisely those which obey the following 
two conditions: 

max{iVf : z r defines a block} = 1 and 
max{iV| :z r defines a contrast} = 2. 

For example, adopting an obvious notation, an A** solution has a simple 
arithmetic mean, plus a set of orthogonal contrasts in each of which the 
nonzero elements comprise m times a value n, and n times a value — m, 
whereas an A* solution has either an unequally weighted mean, or a contrast 
not of this form. 

Examples of the six possible starred structures are given below, the A* 
example being S3 in Figure 3 and detailed in the left-hand part of Table 7. 



Structure type 



A 


B 


c 


Two star 


/1 1 1 1\ 
1 1-1 1 ' 
1000-4 
1-1 1 1 , 

\i -1 -1 1/ 




/1 1 o\ 
10-1 
0101 

Vo 1 0-1/ 


1 1 


10-1-1 
01 1-1 
\o 1 -1 1/ 




One star 


/3 -1 1 0\ 
3-1-1 ' 
2 10 0-2 
2 1 1 1 , 

\2 1 0-1 1/ 


1 1 


{10 2 0\ 
2 0-1 
1 2 

\0 2 0-1/ 


1 1 


(10 2 2\ 
2 0-1-1 
1 2-2 

^0 2-1 1/ 
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Table 7 

Integer representations of the two best one star approximations for exams data 



Variable 






Si 










s 2 






Zl 


Z2 


z 3 


Z4 


z 5 


i\ 


Z2 


Z3 


Z4 


Z5 


Mechanics 


3 


-1 


1 








1 


3 


2 


1 





Vectors 


3 


-1 


-1 








1 


3 - 


-2 


-1 





Algebra 


2 


1 





- 


-2 


1 


-2 





- 


-2 


Analysis 


2 


1 





1 


1 


1 


-2 


-1 


2 


1 


Statistics 


2 


1 


- 


-1 


1 


1 


-2 


1 


-2 


1 


Accuracy 


0.996 


0.928 


0.937 


0.937 


0.959 


0.997 


0.956 


0.9£ 


S 0.978 


0.959 


Variance (%) 60.2 


17.3 


8.9 


7.9 


5.7 


63.3 


14.2 


8.9 


7.8 


5.7 



2.4.4. Empirical support for assumed models: Points of contact with Rous- 
son and Gasser. Our approach seeks solutions supported by the data in the 
sense that no modeling assumptions are imposed, apart from our axes be- 
ing orthogonal and containing vectors of integers. Therefore, if an optimal 
solution under some modeling assumptions is produced by our analysis, this 
provides empirical evidence in favor of such a model. 

We develop this general point here with respect to the Rousson and Gasser 
method, as reported in Rousson and Gasser (2004), with which there are 
clear points of contact, recalling that it applies to correlation matrices only. 

A key point is that two star structures emerging from our essentially ex- 
ploratory analysis of the data correspond to assumed structures optimally 
fitted to it in Rousson and Gasser (2004), with the added condition of or- 
thogonality between all pairs of approximate axes. Accordingly, solutions 
generated by Rousson and Gasser (2004) can only coincide with ours when 
they are orthogonal, in which case they are two star solutions. In particular, 
one star solutions — involving weighted means and/or contrasts not of the 
A** form noted above — cannot arise in such an analysis. 

For any given number of blocks b, the scalar target function optimized 
in Rousson and Gasser (2004) — the 'corrected sum of variances' [see the 
paper by Gervini and Rousson (2004)], denoted here by optim(S) — can be 
used whether components are orthogonal or not. Although not optimized 
for in our approach, its value can be calculated for each of our solutions and 
compared to the best value found in Rousson and Gasser (2004). However, its 
maximization reflects an exclusive interest in explaining variability, whereas, 
being as interested in exploring potential scientific laws, our approach treats 
all eigenvectors of interest equally. 

Overall, the two methods will give complementary results, agreement only 
being expected when there is strong empirical evidence of an orthogonal two 
star structure underpinning the variability in the data. 
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2.4.5. A total order of solutions. A principled total order of the full set 
of solutions S is now obtained, in two stages. 

First, we place each solution into one of four classes, ordered by inter- 
pretability: two star solutions, one star solutions, unstarred solutions lying 
in the minimal set for S, and the rest. All solutions in a higher class are 
ranked ahead of all those in a lower class. 

Then, we order the solutions within each class by simplicity and accuracy, 
with our usual bias toward the former, as follows. Having found the minimal 
set for a given class, we give its solutions the highest available rankings, 
ordering them by compl(S) [any ties being broken by discr(S)} — in other 
words, working from right to left in the accuracy-simplicity scatterplot. We 
now remove this minimal set from the class and repeat the ranking procedure 
on the remainder until all solutions in the class have been ranked. 

Tabular information for each solution is presented to the user in the re- 
sulting total order. For visual clarity, solutions in the lowest class — those 
neither starred, nor minimal for S — are not numbered in the scatterplot. 

For example, for the exams data, the two star class comprises Si and 52 
(shown in Table 6), reflecting the fact that they are considered top in terms 
of interpretability. Both are of type A. Between them, Si is ranked higher 
as it dominates S2, having the same overall complexity but a much better 
minimum accuracy attained: 0.937 compared to 0.729 (for the fourth axis in 
both cases). Indeed, the corresponding angle for this axis being some 43°, it 
seems likely that many users will rule out S2 as being insufficiently accurate. 

The one star class comprises solutions 5*3 to Sn. Among them, solutions 
S^-Sq have the highest rankings since they form the corresponding minimal 
set — that is, within this class, it is not possible to improve on either overall 
simplicity or accuracy without doing worse on the other criterion. They are 
ranked by overall simplicity [small values of compl(S)]. Removing them from 
the class and continuing, the new (ordered) minimal classes are SV to Sg and, 
finally, Sio and Su. For the exams data, S12 is the only unstarred solution 
in the minimal set for S, while there are no solutions in the lowest class. 

2.5. Running example: Summary comparison of results. We summarize 
here our results for the exam data running example (Section 1.4) displayed 
in Figure 3, whose terminology is explained above, comparing them with 
those of Rousson and Gasser (2004) and Vines (2000). 

Overall, the user is referred first to Si, shown in the left part of Ta- 
ble 6. This effectively combines simplicity, accuracy and subject matter in- 
terpretability, this latter being particularly straightforward: 

6i\: Represents overall mathematical ability. 

6t2'- Contrasts closed- and open-book exam performance, omitting Algebra. 
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Fig. 4. Scatterplot matrix of the simplified principal components for the exams data. 



03: Contrasts performance in the two closed-book exams, Mechanics and 
Vectors. 

04: Contrasts performance in the two open-book exams, Analysis and 
Statistics, included in &2- 

05: Contrasts Algebra with all other subjects. 

Figure 4 shows the scatterplot matrix for Si, the visualization and di- 
mension reduction features offered by such orthogonal-axis plots only being 
guaranteed with rotation approaches such as ours. The performance of each 
student on each of these five readily interpreted axes is visible. In particular, 
two or three students stand out at either extreme of overall mathematical 
ability, these students having very similar open- and closed-book perfor- 
mances as measured by &2- Again, we can see at once that there are no great 
correlations induced by this simplification — in fact, the two largest absolute 
correlations between the simple components are about 0.2, for (011,0:5) and 
(01,04), respectively. Overall, the scatterplot matrix is visually close to the 
one given by an exact principal component analysis, but much more inter- 
pretable. 
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Skipping S2, the two best one star solutions, 5*3 and S4, again both of 
type A, are shown in Table 7. They remain clearly interpret able, having 
greater overall simplicity than S\ and comparable accuracies to it (indeed, 
S4 dominates Si). In particular, the first two simple components comprise 
an overall mean and an open-closed book contrast, S3 using a weighted 
mean and S4 a weighted contrast. Between them, our automated bias toward 
simplicity puts S3 first. In it, all other contrasts are either within closed- 
book exams (£3) or within open-book exams (£4 and Z5). Overall, S3 and S4 
provide helpful, alternative views of the same data. As noted above (Section 
2.4.4), Rousson and Gasser (2004) will not report such one star solutions. 

The default version of Rousson and Gasser (2004) estimates one block 
to be appropriate for this data set. Our Si solution coincides with their 
corresponding optimal 6=1 fit, providing empirical support for its implicit 
model (see Section 2.4.4). Although orthogonal, their optimal b = 2 fit does 
not appear among our solutions, adding further empirical evidence that a 
two block model is not appropriate for these data. 

The method of Vines (2000) with associated parameter c = produces the 
same first and third components as our Si. Its other components differ and 
are somewhat harder to interpret, the highest complexity (11 for component 
4) exceeding N* = 9. 

3. Further examples. 

3.1. Reflexes data. The reflexes data, taken from Section 3.8.1 of Jol- 
liffe (2002), comprise measurements on 143 individuals of left and right re- 
flexes for five parts of the body, three in the upper limb and two in the 
lower. 

A principal component analysis of the correlation matrix is reported in 
Table 8. This brings out some of the structure in the data. It also provides 
a further example of the appropriateness of taking equal scientific interest 
in all the components. 

The dominant component is an overall mean, while components 2-5 con- 
trast reflexes in different parts of the body. Smaller components mainly 
contrast reflexes on the left and right sides of the body, the substantially 
smaller variances associated with them suggesting near constant linear rela- 
tionships. However, more detailed interpretation of the principal components 
is not immediate. For example, interpretation of the first principal compo- 
nent is impaired by variability in the loadings, notably the relatively small 
ones allocated to the two ankle measurements. 

3.1.1. Results of our approach. Our approach provides six different solu- 
tions for these data. The corresponding accuracy-simplicity plot (Figure 5) 
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Table 8 

Exact PC A loadings (rounded to 2 decimal places) for the reflexes data 



Variable 


qi 


Q2 


Q3 


q4 






qr 


qs 


qg 


qio 


triceps. R 





35 


-0 


18 





18 





49 


-0 


27 


-0 


06 


-0 


05 





00 





10 


0.69 


triceps. L 





36 


-0 


19 





15 





47 


-0 


27 


-0 


02 


-0 


13 





01 


-0 


13 


-0.70 


biceps. R 





36 


-0 


13 


-0 


14 





04 





71 


-0 


50 


-0 


22 


-0 


03 


-0 


19 


0.04 


biceps. L 





39 


-0 


14 


-0 


09 





05 





41 





70 





35 





02 





19 


-0.03 


wrist. R 





34 


-0 


24 





14 


-0 


51 


-0 


16 


-0 


21 


-0 


13 


-0 


01 





67 


-0.10 


wrist. L 





34 


-0 


22 





17 


-0 


52 


-0 


23 





11 





08 





03 


-0 


67 


0.12 


knee.R 





30 





29 


-0 


50 





02 


-0 


24 


-0 


35 





62 


-0 


02 





01 


-0.04 


knee.L 





27 





35 


-0 


54 


-0 


07 


-0 


18 





28 


-0 


63 





02 


-0 


02 


0.06 


ankle. R 





20 





53 





41 


-0 


03 





07 





03 





00 


-0 


71 


-0 


01 


-0.02 


ankle. L 





19 





54 





40 


-0 


02 





10 


-0 


04 





01 





70 





03 


-0.01 


Variance (%) 


52 


23 


20 


36 


10 


94 


8 


57 


4 


96 


1 


08 





86 





59 





23 


0.19 



shows S\ and S2 with two stars, £3 and S4 with one star, S5 as an unstarred 
minimal solution, and one unlabeled 'other' solution Sq. 

The user is referred first to solution Si, shown in Table 9, which has the 
following clear interpretation. The dominant simple component &i is just 
the simple average of all the reflexes, while &2 contrasts those in upper and 
lower limbs. Again, 03 contrasts the two lower limb parts, while and 0-5 



♦ 


o 



Two star 
One star 
Minimal 
Other 



Q— 



S3 s 



0.05 
- Accuracy 



0.2 

Discrepancy - 



Fig. 5. Solution set S for the reflexes data. 
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Table 9 
Integer representations for Si 



Variable 


Zl 


Z2 


z 3 


Z4 


Z5 


z 6 


Z-7 


Z8 


z 9 


Zio 


triceps. R 


1 


2 




1 


1 










-1 


triceps. L 


1 


2 




1 


1 










1 


biceps. R 


1 


2 






-2 


-1 


1 








biceps. L 


1 


2 






-2 


1 


-1 








wrist. R 


1 


2 




-1 


1 








-1 




wrist. L 


1 


2 




-1 


1 








1 




knee.R 


1 


-3 


-1 






-1 


-1 








knee.L 


1 


-3 


-1 






1 


1 








ankle. R 


1 


-3 


1 










-1 






ankle. L 


1 


-3 


1 










1 






Accuracy 


0.98 


0.95 


0.92 


0.99 


0.91 


0.91 


0.91 


0.998 0.95 


0.98 


Variance (%) 


50.8 


20.6 


11.2 


8.6 


5.6 


1.1 


1.1 


0.6 


0.3 


0.2 


Notes: Reiiexes 


data. 


Empty 


entries 


mean 


zeroes. 













contrast the three upper limb parts: first, triceps with wrist; then, biceps 
with these two. Taking the near constant simple components in reverse order, 
«io to as suggest left-right symmetry in triceps, wrist and ankle respectively. 
Finally, taking 07 and a$ together as we may (they have essentially the 
same variance) , the two-dimensional subspace which they span suggests left- 
right symmetry in knees and biceps. This follows at once from considering 
their sum and difference, corresponding to a 45° rotation of axes within this 
subspace. 

The variance explained by the first five simple components here is 96.7% 
compared to 97.1% for the exact principal components, each being close 
to that of its optimal counterpart. There are three absolute correlations of 
about 0.3 [for (05,07), (06,09) and (07,09)], all others being appreciably 
smaller. 

The one star solution S3 is very close to S\ in Figure 5. Indeed, it differs 
from it only in cxq and 07 representing another rotation within their span, 
already interpreted above as suggesting left-right symmetry in knees and 
biceps. This time, at the cost of increasing the complexity of both 06 and 
07 by one, their accuracies improve to 0.95 and 0.97, respectively. This 
illustrates that subspace rotation can increase accuracy without changing 
overall interpretation. 

Although dominated by Si, the other two star solution S2 provides an 
interesting alternative view. It differs only on two components, both having 
very clear interpretations: 

&2- Contrasts upper and lower limbs, omitting biceps, using only ±1 
loadings. 
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05: Contrasts biceps with everything else, using only 1 and —4 loadings. 

The other, unstarred, minimal solution S5 (details not shown) is simple 
and accurate, but somewhat less clearly structured. Its a.4 and as to ctio 
agree exactly with S± , the sign pattern of a§ and a.7 also agreeing (each now 
has loadings of ±2). di is a weighted mean, omitting ankle, a.% also omits 
ankle, contrasting biceps with the other three body parts. 02 contrasts upper 
and lower limbs, omitting biceps. Finally, 03 also omit biceps, contrasting 
knee with the other three body parts. 

The other two solutions, S4 and Sq, are markedly less simple and accurate. 

3.1.2. Comparison with other approaches. We briefly compare our re- 
sults here with those of other methods. 

The comparison with Rousson and Gasser's approach for these data is, 
essentially, the same as it was for the running exams data (see the end 
of Section 2.5). The default version of Rousson and Gasser (2004) again 
estimates 6 = 1, our Si solution coinciding with their corresponding optimal 
fit, providing empirical support for its implicit model. Although orthogonal, 
their optimal 6 = 2 fit does not appear among our solutions, adding further 
empirical evidence that a two block model is not appropriate for these data. 

Table 10 shows the components obtained using the method of Vines (2000) 
with associated parameter c = 0. Compared to the original principal compo- 
nent analysis (Table 8), this gives a substantially simpler, more interpretable 
solution. It differs from Si, especially for middle components, but interest- 
ingly picks up the same simplified components &i, a^ and dio, interpreted 
above (this might, in part, be because Vines' method is able to seek simplifi- 
cations of components in a nonsequential fashion). Axes as and ag here have 



Table 10 

Integer representations for the reflexes data using Vines ' method with c~0 



Variable 


zi 


z 2 


23 


2 4 


2 s 


26 


27 


2s 


29 


ZlO 


triceps. R 


1 


1 


2 


1 


19 


-19 


19 


19 


-19 


-1 


triceps. L 


1 


1 


2 


1 


19 


-19 


19 


19 


-19 


1 


biceps. R 


1 


1 


-1 




-42 - 


-2479 


-42 


-42 


12 




biceps. L 


1 


1 


-1 




-40 


2561 


-40 


-40 


40 




wrist. R 


1 


1 


2 


-1 


18 


-19 


19 


18 - 


-2541 




wrist. L 


1 


1 


2 


-1 


20 


-19 


19 


20 


2503 




kncc.R 


1 


-1 


-9 




10 


-9 - 


-2512 


10 


-10 




kncc.L 


1 


-1 


-9 




8 


-9 


2530 


8 


-8 




ankle. R 


1 


-2 


6 




-5 


6 


-6 - 


-2529 


6 




ankle. L 


1 


-2 


6 




-7 


6 


-6 


2517 


6 




Accuracy 


0.98 


0.97 


0.99 


0.99 


0.97 


0.85 


0.89 


1.00 


0.95 


0.98 


Variance (%) 


50.8 


21.5 


11 


8.6 


5 


1.2 


0.99 


0.60 


0.30 


0.20 



Note: Empty entries mean zeroes. 
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Table 11 

Exact principal component analysis loadings (rounded to 2 decimal 
places) for the alate adelges data 



Variable 


CI 1 






Ml 


Length 


0.25 


0.03 


0.02 


0.07 


Width 


0.26 


0.07 


0.01 


0.10 


Forwing 


0.26 


0.03 


-0.05 


0.07 


Hinwing 


0.26 


0.09 


0.03 


0.00 


Antseg 1 


0.24 


-0.18 


0.04 


-0.01 


Antseg 2 


0.25 


-0.16 


0.00 


0.02 


Antseg 3 


0.23 


0.24 


0.05 


0.11 


Antseg 4 


0.24 


0.04 


0.16 


0.01 


Antseg 5 


0.25 


-0.03 


0.10 


-0.02 


Tarsus 3 


0.26 


0.01 


0.03 


0.18 


Tibia 3 


0.26 


0.03 


0.08 


0.20 


Femur 3 


0.26 


0.07 


0.12 


0.19 


Rostrum 


0.25 


-0.01 


0.07 


0.04 


Ovipositor 


0.20 


-0.40 


-0.02 


0.06 


Spiracles 


0.16 


-0.41 


-0.19 


-0.62 


Ov-spines 


0.11 


-0.55 


-0.15 


0.04 


Anal fold 


-0.19 


-0.35 


0.04 


0.49 


Ant-spines 


-0.13 


-0.20 


0.93 


-0.17 


Hooks 


0.20 


0.28 


0.05 


-0.45 


Variance (%) 


73.0 


12.5 


3.9 


2.6 



virtually the same accuracy as in S\ but are much more complex, illustrat- 
ing that our bias toward simplicity does not necessarily sacrifice accuracy. 
Indeed, having a dominant pair of elements of nearly equal size and opposite 
sign, 6>8 and dg are both angle-close to the corresponding axes in Si, inter- 
preted above as suggestive of left-right symmetry in the corresponding part 
of the body. By the same token, a§ and a.7 are also angle-close to suggesting 
corresponding left -right symmetries. The remaining axes, 02, 0:3 and ds, 
are more accurate, but less directly interpretable, than those in Si. 

3.2. Alate adelges data. These data consist of 19 anatomical measure- 
ments of 40 alate adelges (winged aphids), as reported in Jeffers (1967). The 
measurements taken on each aphid are its length and width, fore-wing and 
hind-wing lengths, 5 antennal segment lengths, 3 leg bone measurements, 
measurements of the rostrum and the ovipositor, anal fold, and counts of 
the number of spiracles, ovipositor spines, antennal spines and hind- wing 
hooks. 

Jeffers (1967) focuses attention on the k = 4 dominant eigenvectors of the 
correlation matrix shown in Table 11, these accounting for 92% of the total 
variability in the data. He interprets «i as a general index of size, and «2 
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to 04 as essentially measuring the number of ovipositor spines, of antennal 
spines and of spiracles, respectively. 

These tidy interpretations are not without difficulty. For a±, some later 
variables have notably smaller loadings, of both signs. For each other a r , the 
interpretation offered amounts to 'thresholding' (setting all smaller loadings 
to zero) at the maximum absolute value of q r . Whereas this looks quite 
reasonable for 03, it seems much less so for 0L2 and a^, these axes contain- 
ing a range of substantial loadings, some of comparable magnitude to their 
maximum. 

Inspection of the correlation matrix in Jeffers (1967) shows that, while 
positively correlated with each other, the two variables with negative load- 
ings on «i are, with one insignificant exception, negatively correlated with 
all the other variables. Indeed, a single negative (at —0.026, essentially zero) 
correlation remains when both their signs are reversed. Following Rousson 
and Gasser (2003), one strategy is to reverse these two signs, analyze the 
data in some way and then, to retain the interpretation of the original vari- 
ables, switch them back again. We call this process 'sign reversal.' 

We compare here three solutions for these data, detailed in Table 12: 

• Si, as defined above, 

• Si, the result of an Si analysis with sign reversal, and 

• Sue, & n optimal Rousson and Gasser (2004) fit with 6 = 1 and, again, 
sign reversal. 

Vines' method is not capable to produce any answer here, mainly due to the 
complexity of some of the approximate loading vectors growing far too big. 

Whereas none is ideal (in particular, there is substantial correlation in 
each, especially Si), these three solutions provide helpful, complementary 
views of these data. We discuss them in turn. 

As expected, given that qi is not single-signed, Si is unstarred. Neverthe- 
less, it is perhaps the most easily interpreted solution. It is the simplest and 
sparsest, all loadings being 0, 1 or —1. Its dominant component is the sim- 
ple average of all the variables, excluding the four count variables and anal 
fold. Its third component is the number of antennal spines. The other two 
components are simple contrasts, whose pattern of zeroes is consistent with 
thresholding at lower levels with only two exceptions (the last two loadings 
in a.2), these zeroes ensuring orthogonality. However, 612 is not very accurate 
and, indeed, explains less variance than 0:4. 

Si is the most accurate solution, the minimum accuracy being 0.92. Al- 
though also unstarred, it is perhaps the next most easily interpreted. It is 
nearly as simple and as sparse as Si. It has a comparable corrected sum 
of variances to the optimized Srg fit (94.1% compared to 94.5%), achieved 
despite having lower variances associated with the last two components, 
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Table 12 

Integer representations for the alate adelges data 



Variable 




S 


L 














Sr 


G 




7 i 




7<3 


7 1 


7 i 


7n 


7,n 


7, A 


7. i 




7n 


7 A 


Length 


1 








2 








1 




3 




Width 


1 








2 








1 




3 


1 


Forwing 


1 








2 








1 






1 


Hinwing 


1 








2 








1 




3 




Antspo" 1 










2 


1 






1 








Antseg 2 


1 








2 


1 






1 








Antseg 3 


1 


-1 






2 


-1 






1 


3 


3 


1 


Antseg 4 


1 








2 








1 




3 




Antseg 5 


1 








2 








1 




3 




Tarsus 3 


1 








2 


-1 




1 


1 




3 


1 


Tibia 3 


1 








2 






1 


1 




3 


1 


Femur 3 


1 








2 






1 


1 




3 


1 


Rostrum 


1 








2 








1 




3 




Ovipositor 


1 


1 






1 


2 






1 


-4 




1 


Spiracles 




1 




1 




2 


1 


-2 


1 


-4 - 


11 


-3 


Ov-spines 




1 






1 


2 






1 


-4 - 


11 


1 


Anal fold 




1 




-1 


-1 


2 




2 


-1 


-3 




3 


Ant-spines 






1 






1 


-2 


-1 


-1 


-3 


11 


-1 


Hooks 








1 


2 


-1 




-2 


1 


3 


3 


-3 


Accuracy 


0.93 


0.87 


0.93 


0.90 


0.97 


0.95 


0.92 


0.96 


0.98 


0.94 


0.75 


0.96 


Variance (%) 


63.5 


9.7 


5.3 


9.8 


69 


11.6 


6 


2.7 


70.2 


11.3 


7.8 


3 



Optimality (%) 86.9 
Max correl 0.83 
Note: Empty entries mean zeroes. 



94.1 
0.63 



94.5 
0.63 



consistent with their suggestion of underlying regularities. Compared to S±, 
its dominant component gains accuracy and variance explained, but is less 
easily interpreted. Finally, the pattern of zeroes in its other components is 
consistent with thresholding at yet lower levels with only one exception (for 
Tarsus 3 on d^), this nonzero loading ensuring orthogonality. 

A b = 1 solution such as Srg comes from fitting the following assumed 
form of two star solution to the (sign reversed) data: a simple arithmetic 
mean, plus a set of contrasts in each of which the nonzero elements comprise 
m times a value n, and n times a value — m. As happens here, these contrasts 
need not be orthogonal, so that Srg cannot appear among our solutions. 
Its dominant component fits well, having the highest accuracy and variance 
explained, while the zeroes in its second component are, without exception, 
consistent with the same lower thresholding as in S± . However, the other fits 
seem poor, a.3 having a particularly low accuracy, while is considerably 
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less sparse than in S± but without improving accuracy. Overall, despite drop- 
ping the orthogonality constraint, Srg comes third in terms of simplicity, 
accuracy and sparseness. A likely reason for this is that its assumed model 
seems, at most, appropriate to the first two components. 



3.3. Larger data sets. In this section we use simulated examples to give 
an idea of how our method behaves when the number of variables p grows. 
These examples also illustrate the secondary, initially surprising, fact that 
certain simple structures in the population principal components can be 
recovered using only information from the sample. Such behavior has been 
observed, for small dimensions, in one other simplification method: see Sun 
(2006). 

For p = 8, 16, 32, 64, 128 and 256, we simulated 100 data sets of size n 
from a p-variate, zero mean, normal distribution with covariance matrix of 
the following form. Its matrix Q pop of population eigenvectors is the particu- 
lar integer matrix with orthogonal columns Z pop detailed below, normalized 
to unit column length. Its spectrum has four reasonably well-separated dom- 
inant eigenvalues (16, 8, 4, 2), the rest being equal with sum 1. Thus, for each 
p, the first four population components explain 30/31 ~ 97% of total vari- 
ability, the corresponding four sample components being used as input data 
here in each case. Sampling variability was kept constant across different 
values of p in the sense that the ratio of the number of degrees of freedom 
in the centered data to that in Q pop was kept fixed at 8, giving n = \p — 3. 

We use the following two star structure for the population eigenaxes gener- 
ated by Z pop . A so-called Hadamard matrix of order p = 2 m can be obtained 
inductively using 



and 



for m > 1 



J pop 



J p°p 



(2 m ) 



(2) 



1 



,(2 r 



im- 

'pop' 

Z/om- 
pop \^ 



^pop 

z 



(2 



m—l\ 



pop^ 



For example, this gives 



-"pop 



(4) 



f 1 
1 

1 

Vi 



(axis-equivalent to that found in the blood flow data of Section 1.3.2). This 
structure is the opposite of sparse, having no zeroes. Instead, it has what 
Chipman and Gu (2005) call 'homogeneity' At the same time, it is ex- 
tremely simple. For any m, the A part of our overall complexity measure 
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7 




2 - 



16 32 64 128 256 

P 

Fig. 6. Mean computation times relative to the time for p= 8. 

(Section 2.4.1) takes its maximum value 1/2, so that compl(S pop ) = 1.5 if 
and only if Z pop has this Hadamard form. 

Figure 6 shows that the computation time required grows roughly linearly 
in p, which gives a good indication that the method is relatively quick when 
p < 256 and there is a simple structure in the sample eigenvectors. 

Figure 7 is an accuracy-simplicity scatterplot for the 100 simulated values 
of Si obtained with p = 32. The percentage of simulations with compl(Si) = 
1.5, corresponding to S\ having a Hadamard structure, is substantial. Over- 
all, this percentage was found to increase with p, as was the minimum ac- 
curacy attained. 

4. Discussion. Combining principles with pragmatism, a new approach 
and accompanying algorithm to interpret (a subset of) principal components 
have been presented and shown to work well on a range of examples. The 
key idea is to approximate each eigenvector involved by an integer vector 
close to it in angle terms, while keeping the size of its maximum element as 
low as possible. Requiring orthogonality, attractive visualization and dimen- 
sion reduction features of principal component analysis are retained. Being 
essentially exploratory, alternative views of the same data are provided in a 
clear, principled order. The user is then free to choose the set of solutions 
that best match his or her trade-off between simplicity and accuracy. Again, 
other things being equal, explicit models can be checked by seeing if their fits 
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Fig. 7. Accuracy and complexity for solution to simulated data when p = 32. 



occur in our exploratory analysis (Sections 2.5 and 3.1.2), while alternatives 
can be provided where preconceived models appear inappropriate (Section 
3.2). Although not directly targeted, sparsity can emerge where appropri- 
ate, as in the example in each of the three Sections just cited. Section 3.3 
gives some idea of our algorithm's performance in larger data sets, while also 
illustrating that sparsity is not always appropriate. Overall, this new tool 
adds to the applied statistician's armoury, effectively combining simplicity, 
retention of optimality and computational efficiency, while complementing 
existing methods. 

Although the examples given establish that our approach is useful in 
practice, an extensive simulation study is required to more fully explore its 
performance and to compare it with other simplification methods, such as 
those proposed by Rousson and Gasser, Chipman and Gu, and Vines. Such a 
simulation study would also provide further information about appropriate 
default values for the tuning parameters employed and help to identify pos- 
sible alternative measures of interpretability, simplicity and accuracy that 
both highlight the best solutions and most effectively indicate situations 
where simple structures are perhaps not there to be found. 

Any approach to interpreting principal components involves making spe- 
cific choices and an overall compromise between conflicting objectives. Vari- 
ants and extensions of the approach presented here meriting future study 
include: 
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• exploring the potential usefulness of sequences of approximation other 
than the four employed here (Section 2.1); more radical is the possibility 
of simplifying two or higher-dimensional subspaces of eigenvectors at each 
step; 

• varying the minimum accuracy required across eigenaxes, for example, to 
reflect situations where it is more important for some components to be 
approximated accurately than others (in particular, this may be useful in 
connection with the variant discussed next); 

• adapting it to reflect scientific contexts in which interest centers solely on, 
say, explaining variability; 

• trading off the benefits of orthogonality against the advantages of sepa- 
rately approximating each eigenaxis; 

• applying its ideas in other contexts, including Linear Discriminant Anal- 
ysis and Canonical Correlation Analysis. 

APPENDIX A: DISTANCE INTERPRETATION OF 

THE MINIMUM ACCURACY ATTAINED 

We show here that the minimum accuracy attained is a known, strictly de- 
creasing, function of a natural measure of distance between any two ordered 
sets of axes. 

For any two vectors x and y in MP with unit length, define the angle 
< 9 < vr between them by cos((9) = x T y and the following measure of dis- 
crepancy between the axes ±x and ±y which they generate: 

5(±x, ±y) := min{||u - v\\/y/2 : u G {x, -x}, v G {y, -y}}. 

Then, omitting the straightforward proof, we have that 

5(±x, ±y) = min{||x - y\\/V2, ||x + y\\/V2} = Vl-|cos(0)| 

is a distance function on the set of all axes in MP (i.e., is nonnegative, zero 
only when ±x = dby, symmetric and obeys the triangle inequality), the 
angle-accuracy attained measure being thus a strictly decreasing function 
of it, namely, 

|cos(0)| = l-5 2 (±x,±y). 

For any two ordered sets of axes ±X := (±x 1 | • • • |±x m ) and ±Y := (iyj 

• • • |±y m ) in MP, with ||x r || = ||y r || = 1 and corresponding angles < 9 r < 
7T given by cos(0 r ) = xJTy r (1 < r < m), define now the following overall 
discrepancy measure between them: 

A(±X, ±Y) := max{£(±x r , ±y r ) : 1 < r < m}. 
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Then, using the properties of <5(-, •) just established, and again omitting the 
straightforward proof, we have that 

A(±X, ±Y) = y/l - min{| cos(0 r )| : 1 < r < m} 

is a distance function on the set of all ordered sets of axes in R p , the minimum 
angle-accuracy attained measure being thus a strictly decreasing function of 
it, namely, 

min{| cos(0 r )| : 1 < r < m} = 1 - A 2 (±X,±Y). 

APPENDIX B: IMPLEMENTATION 

In Section B.l we define a key approximation to the solution of the prob- 
lem of minimizing accuracy without orthogonality restrictions, for a given 
complexity. In Section B.2 we outline approaches to the search for a r (8). 

A set of R routines implementing our approach is available from the au- 
thors upon request. 

B.l. JV-ratio simplification. For a given vector u G R^ (/ > 2) and given 
complexity N, we describe here an approximation to the solution of the 
problem of maximizing accu(£(u),£(z)) over z E subject to compl(z) = 
N. 

A necessary condition for z to be optimal is that u and z have the same 
signs, while the rank vector of |z| coincides with that of |u|. Thus, subsuming 
sign changes and a permutation as required, there is no loss in taking u% > 
v>2 > • • • > U[ > and restricting attention to integer vectors z such that z\ > 
z 2 > • ■ • > zi > 0, the corresponding inverse permutation and sign changes 
being applied at the end. 

The N -ratio simplification of u is defined as = (N, . . . , z^ N ^) T 

in which the {zi N ^} l r=2 are chosen so that each £ r := tan~ 1 (zr 7V ' ) /A r ) is as 
close as possible to ip r := tan _1 (A r ) where A r := u r /u\ (a final division by 
hcf (|z^ |) being left implicit). Explicitly, for each r = 2, . . . , I, defining l r as 
the integer part of A r A^ and < a r < ip r < (3 r < it /A by a r := tan _1 (/ r /A r ) 
and /3 r :=tan _1 ((/ r + l)/N), we put 

m , .(AT) . = f lr, if A < («r + /3r)/2, 

[ ' r ■ \l r + l, ifVr >(a r +p r )/2. 

The accuracy of this approximation comes from the fact that £(z^ N ^) = 
£(u) if and only if zi ■ /N = u r /u\ for each r = 2, . . . ,1. This is a very fast 
approximation since, reordering of elements apart, the computational effort 
involved is linear in I. 

iV-ratio simplification has the additional advantage that neighboring solu- 
tions close to z^) can also be obtained easily. Before permuting back to the 
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original order and restoring the signs, I — 1 alternative neighboring approx- 
imations z can be obtained by adjusting the entries of in the following 
way: z T = + 1 if Zr = l r and z r = Zr — 1 if Zr = lr + 1- 

B.2. Search for a r (0). For the first axis to be simplified, a\{9) is ap- 
proximated by the iV-ratio simplification of qi with the smallest N that 
satisfies the minimum accuracy required cos(0). For r > 2, the orthogonality 
restrictions need to be taken into account. Here, we search for a r {9) using 
a hybrid approach which takes the best solution out of the three different 
procedures described below (two in Section B.2.1 and one in Section B.2.2), 
as ranked first by the smallest value of N r (9) found, and then by accuracy. 

We denote by H r _i the matrix representing orthogonal projection onto 
M(ZiJ_ l ), the null space of Zj_ 1 , where Z r _i is any p x (r — 1) matrix 
whose columns are integer representations of the axes already simplified, 
&i,...,a r -i. As detailed in Section B.2.4 below, H r _i = H r _i/iV r _i for 
some known integer matrix H r _i and positive integer iV r _i. 

B.2.1. Algorithms based on convergence to orthogonality. We describe 
here two versions of an iterative algorithm to find an axis of minimal com- 
plexity that satisfies the orthogonality and minimum accuracy restrictions. 
Starting with N = 1, the algorithm works by first obtaining the iV-ratio 
simplification of = H r _iq r and then modifying it, directly controlling its 
complexity, while aiming to maintain accuracy and improving the degree to 
which the orthogonality conditions are met. 

The algorithm is based on the function < w(z) := accu(z, H r _iz) < 1 
which measures the closeness of l(z) to A/"(ZjLi ) , the orthogonality condi- 
tions being met if and only if w(z) = 1. 

The algorithm has three stages: 

Stage 1. [1] Compute z^ N \ the TV-ratio simplification of q^r. [1*] If w(z^- ) ) = 
1 and satisfies the minimum accuracy required, we take a r (9) to be 
l(z^) and the algorithm stops. If oj(z^) = 1, but z^ does not satisfy 
the minimum accuracy required, we update N ^— TV + 1 and return to [1]. 
Otherwise, ^(z^) < 1 and we move on to Stage 2. 

Stage 2. Construct a set of neighbor vectors Z C 1^ by increasing and 
decreasing one of the entries of by one unit (see Section B.l), identi- 
fying its (possibly empty) subset Z\ of vectors with cj(z) = 1. If there is a 
z G Z\ satisfying the minimum accuracy required, we take a r {&) to be the 
most accurate such vector and the algorithm stops. If there isazgZi, 
but no such vector satisfies the minimum accuracy required, we update 
N <r- N + 1 and return to [1]. Otherwise, w(z) < 1 for all z £ Z and we 
identify its (possibly empty) subset Z{9) of vectors satisfying the mini- 
mum accuracy required. If Z{9) is the empty set, we move on to Stage 3. 
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Otherwise, we set z' to be arg max{w(z) :z G Z(9)}. If uj(z') < uj(z^ n ^), 
we again move on to Stage 3. Otherwise, if ui(z') > uj(z( n )), we update 
z^) <— z* (defined below) and return to [1*]. We have two variants of this 
algorithm, corresponding to two different choices of z*: 

1. z* = z': this hungrily pursues orthogonality, at a potential loss of ac- 
curacy. 

2. z* = argmax{accu(q^-, z) : w(z) > ui(z^ N '),z G 2(9)}: this retains ac- 
curacy as much as possible, while improving the extent to which the 
orthogonality conditions are met. 

Stage 3. We construct a set of higher order neighbor vectors Z by moving 
more than one entry of z^ N ' in the direction defined by the integer vector 
H r _izW - iV r _izW. We then follow the same procedure as in Stage 2 
except that, if 2(9) is empty or w(z') < cj(z^), we now update N 
N + 1 and return to [1]. 

Remark 1. If we obtain a vector of complexity strictly bigger than 
the current of N, we do not consider it at that stage, but keep it for later 
feasibility, provided its complexity is not bigger than N* . 

Remark 2. It is easy to show that, for any z G TL^ 1 with H r _iz ^ P , 

accu(q^r,z) ||H r _iz|| 
accu(q^, H r _iz) ||z|| 

= accu(z, H r _iz) < 1, 

so that accu (q^r, H r _iz) > accu(q^r,z), equality holding if and only if z 
obeys the orthogonality conditions z = H r _iz. For any other z, projection 
strictly increases accuracy. Given the general trade-off between accuracy and 
simplicity, this suggests that projection tends to increase complexity. Ac- 
cordingly, there is a premium on algorithms, such as the one just described, 
which avoid projection per se. 

B.2.2. Algorithm based on exact orthogonality. The following algorithm 
ensures exact orthogonality at every step by restricting attention to axes 
of the form ^(O r _iy), y G Z(P~ r+1 ), where O r _i is a p x (p — r + 1) integer 
matrix whose columns form a basis of Af(ZiJ_i). The particular matrix O r _i 
used, which appears to work well, mitigates the fact that the complexity and 
accuracy of z = O r _iy are indirectly controlled; see Section B.2.3. 

Putting y* := (Oj_ 1 O r _i) _1 Oj_ 1 q r , O r _iy* = is the closest point to 
q r in AA(Zj_ 1 ). Whereas the elements of y* will not in general be integers, 
we may obtain an approximation a r (9) to a r (9) as follows: 
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1. Compute the set of integer vectors y C z( p_r+1 ) obtained by iV-ratio 
simplification of y*, together with their angle neighbors, for all N < N* . 

2. Obtain the set £(O r _i3 ; ) of all axes £(O r -i~y) with y G y, and find the 
minimum complexity N r (6) over all axes in this set which satisfy the 
minimum accuracy requirement cos(0). 

3. Call a r {9) the most accurate axis in £(O r _i^) with complexity N T {0). 

B.2.3. Choice of O r _i. The choice Oo = I p is clearly optimal. For r > 1, 
the choice of O r _i depends on an initial permutation of the rows of Z r _i — 
defined below — such that the first r — 1 are linearly independent, forming a 
nonsingular matrix Z a in the corresponding partition Zjl x = (ZjZ^). This 
permutation is inverted at the end to maintain the identity of the variables. 

Conformably partitioning uel p as u T = (uJuJ), u G J\f(Zj_ 1 ) when 
Zju a + Z^u fe = (0, . . . ,0) T . Equivalently, det(Z a )u a = - cof (Z a )Z^~u fe , where 
cof(Z a ) is the matrix of cofactors of Z a . Thus, 



/ -cof(Z fl )Z 6 T \ 
Url - v det(Z a )I p _ r+ J 



is an integer matrix whose columns form a basis of A/'(Zj_ 1 ). 

For any y G Z^ p_r+1 \ conformably partitioning z = O r _iy as z T = (zjzj) 
gives £(zb) =£(y). We choose the initial permutation of the rows of Z r _i so 
that the elements of corresponding to zj, have the largest possible set 
of absolute values, these contributing most to angle-accuracy. Specifically, 
we proceed as follows. First, permute the elements of so that the ab- 
solute values of its elements are in increasing order, permuting the rows of 
Z r _i accordingly. Find the first set of r — 1 rows of Z r _i having nonzero 
determinant in the lexicographical ordering of such sets by their row labels. 
Finally, maintaining the internal ordering of these rows (and of their p — r + 1 
complementary rows), make them the first r — 1 rows, Z a , of a new matrix 
Z r -i- 

B.2.4. Construction of the projector H r _i. The matrix H r _i is propor- 
tional to an integer matrix, so that H r _i = H r _i/iV r _i for some integer 
matrix H r _i and positive integer N r -\. Simple updates are available to 
construct this matrix. 

Putting Nq = \ and Ho = Ho = I p , for each r > 1, H r = H r _i — z r zj/||z r || 2 , 
so that H r = H r /N r with H r = [||z r || 2 H r _i — N r -iz r zJ]/h r and N r = [N r -i, x 
||z r || 2 ]//i r , in which h r = hcf (N r _i ||z r || 2 ). The simplicity of these updates is 
another advantage of requiring orthogonality. 
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